library(readr)
library(tidyverse)
## ── Attaching packages ──────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 2.2.1     ✔ purrr   0.2.4
## ✔ tibble  1.4.2     ✔ dplyr   0.7.4
## ✔ tidyr   0.8.0     ✔ stringr 1.3.0
## ✔ ggplot2 2.2.1     ✔ forcats 0.2.0
## ── Conflicts ─────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(ggplot2)
library(gridExtra)
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
library(stringr)
library(ggthemes)
library(dplyr)
library(GGally)
## 
## Attaching package: 'GGally'
## The following object is masked from 'package:dplyr':
## 
##     nasa
library(vcd)
## Loading required package: grid
library(extracat)
library(DAAG)
## Loading required package: lattice
library(forcats)
library(tibble)
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
## 
##     date
library(skimr)
## 
## Attaching package: 'skimr'
## The following objects are masked from 'package:dplyr':
## 
##     contains, ends_with, everything, matches, num_range, one_of,
##     starts_with
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
library(parcoords)

Introduction

Team

  • Manksh Gupta (mg3835): exporatory data analysis, visualization (static plots), research
  • Kathy Lin (kl2615): exploratory data analysis, visualization (static and Shiny dashboard)
  • Louis Massera (lm3287): exploratory data analysis, visualization (static plots)
  • Chong Zhao (cz2470): data processing, exploratory data analysis, visualization (static plots), writing final report

Motivation

Popular music offers a unique lens through which to study how a culture evolves. It allows us to mine insights about what preoccupies a society, what it values in its entertainment, and how its preferences change with each generation. Because an analysis of “all mainstream music” is beyond the practical scope of this class, we limit our view to only the 100 most popular songs for each of the past 50 years.

To enable our analysis, we turn to one of the world’s most popular streaming music providers, Spotify, which not only contains a comprehensive catalog of mainstream songs but also a database of proprietary data describing various auditory features of each song. In addition to these auditory features, another rich source of data is the lyrics of the songs themselves, which when combined with Spotify’s audio features, paints a vivid portrait of a changing musical landscape. Specifically, we seek to explore questions relating to the following themes:

  • Artists (most popular, longevity, most explicit, most danceable, most collaborative, etc.)
  • Words (most frequent, usage over time, relationship to audio features)
  • Audio features (evolution over time)

Data description

For this analysis, we focus primarily on the US music market and use Billboard’s Yearly Top 100 chart as our means of determining the most popular singles for each year. Billboard is a weekly music news magazine that publishes a weekly chart of the top 100 songs sold each week. This weekly chart is considered the industry standard for measuring the success of commercial music and serves as the basis for compiling the Yearly Top 100 chart. This yearly chart starts the first week of December to the last week in November of each year. A single gains points on the yearly chart for each weekly chart it appears on, with a weekly position of 100 yielding a single point and a position of 1 yielding 100 points (source: billboardtop100of.com).

Data acquisition

Because it is not possible to obtain Billboard’s chart data and corresponding song lyrics without some form of extensive web scraping, we opted instead to use a dataset compiled by Kaylin Walker consisting of the yearly top 100 singles and their lyrics from 1965-2015. However, our primary data, Spotify’s audio features, was accessed directly by querying Spotify’s API using the obtained list of top 100 singles, the Python script for which can be found here. We then merge these features back into the Billboard dataset, as detailed here.

df = read_csv('../data/billboard-spotify.csv')
## Parsed with column specification:
## cols(
##   .default = col_double(),
##   rank = col_integer(),
##   song = col_character(),
##   artist = col_character(),
##   year = col_integer(),
##   lyrics = col_character(),
##   release_date = col_character(),
##   spotify_album_name = col_character(),
##   spotify_artist = col_character(),
##   spotify_name = col_character()
## )
## See spec(...) for full column specifications.
colnames(df)
##  [1] "rank"               "song"               "artist"            
##  [4] "year"               "lyrics"             "acousticness"      
##  [7] "danceability"       "duration_ms"        "energy"            
## [10] "explicit"           "instrumentalness"   "key"               
## [13] "liveness"           "loudness"           "mode"              
## [16] "popularity"         "release_date"       "speechiness"       
## [19] "spotify_album_name" "spotify_artist"     "spotify_name"      
## [22] "tempo"              "time_signature"     "valence"

Description of relevant features

The dataset consisting of merged Billboard and Spotify data contains a 5100 rows, one for each song, each with the following features:

  • rank: The rank of the song in its Yearly Top 100 chart, with \(1\) being the highest and \(100\) being the lowest.
  • song: The song’s name.
  • artist: The song’s artist.
  • year: The year in which the song placed in the Yearly Top 100 chart.
  • lyrics: The lyrics for that song.
  • acousticness: A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.
  • danceability: Describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.
  • duration_ms: Length of the song in milliseconds.
  • energy: A measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale.
  • explicit: Binary variable with \(0\) indicating a non-explicit song and \(1\) indicating an explicit song.
  • instrumentalness: Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.
  • liveness: Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.
  • loudness: The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db.
  • mode: Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.
  • popularity: A measure between 0 and 1 indicating how many times that song has been streamed.
  • speechiness: Detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks
  • tempo: The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.
  • time_signature: An estimated overall time signature of a track. The time signature (meter) is a notational convention to specify how many beats are in each bar (or measure).
  • valence: A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).

Source: Spotify API

Analysis of data quality

# skimr output
spotifydf <- as.tibble(df)
skimr::skim(spotifydf)
## Skim summary statistics
##  n obs: 5100 
##  n variables: 24 
## 
## Variable type: character 
##            variable missing complete    n min  max empty n_unique
##              artist       0     5100 5100   1   89     0     2473
##              lyrics     250     4850 5100  12 5758     0     4628
##        release_date     337     4763 5100   4   10     0     1612
##                song       0     5100 5100   1   65     0     4583
##  spotify_album_name     337     4763 5100   1  109     0     3217
##      spotify_artist     337     4763 5100   1   56     0     1866
##        spotify_name     337     4763 5100   1  122     0     4411
## 
## Variable type: integer 
##  variable missing complete    n   mean    sd   p0     p25 median     p75
##      rank       0     5100 5100   50.5 28.87    1   25.75   50.5   75.25
##      year       0     5100 5100 1990   14.72 1965 1977    1990   2003   
##  p100     hist
##   100 ▇▇▇▇▇▇▇▇
##  2015 ▇▇▇▇▇▇▇▇
## 
## Variable type: numeric 
##          variable missing complete    n       mean        sd          p0
##      acousticness     337     4763 5100      0.23      0.24      1.2e-05
##      danceability     337     4763 5100      0.63      0.15      0.13   
##       duration_ms     337     4763 5100 241145.63  61486.57  78200      
##            energy     337     4763 5100      0.63      0.19      0.02   
##          explicit     337     4763 5100      0.093     0.29      0      
##  instrumentalness     337     4763 5100      0.023     0.11      0      
##               key     337     4763 5100      5.27      3.57      0      
##          liveness     337     4763 5100      0.18      0.15      0.019  
##          loudness     337     4763 5100     -8.35      3.49    -25.49   
##              mode     337     4763 5100      0.69      0.46      0      
##        popularity     337     4763 5100     53.36     14.56      0      
##       speechiness     337     4763 5100      0.069     0.073     0.022  
##             tempo     337     4763 5100    119.22     27.22     51.32   
##    time_signature     337     4763 5100      3.97      0.24      0      
##           valence     337     4763 5100      0.61      0.24      0.037  
##         p25       median          p75       p100     hist
##       0.038      0.14         0.37          0.98 ▇▃▂▂▁▁▁▁
##       0.53       0.64         0.73          0.99 ▁▁▂▅▇▇▃▁
##  205680     235933       268640       1561133    ▇▃▁▁▁▁▁▁
##       0.5        0.65         0.79          1    ▁▁▂▆▇▇▇▃
##       0          0            0             1    ▇▁▁▁▁▁▁▁
##       0          5.1e-06      0.00046       0.96 ▇▁▁▁▁▁▁▁
##       2          5            8            11    ▇▃▃▃▂▆▃▅
##       0.084      0.12         0.22          1    ▇▃▂▁▁▁▁▁
##     -10.63      -7.81        -5.64         -1.1  ▁▁▁▂▅▇▇▂
##       0          1            1             1    ▃▁▁▁▁▁▁▇
##      44         55           65            87    ▁▁▂▅▇▇▆▁
##       0.032      0.041        0.066         0.74 ▇▁▁▁▁▁▁▁
##      98.99     117.99       133.17        210.75 ▁▃▇▇▅▂▁▁
##       4          4            4             5    ▁▁▁▁▁▁▇▁
##       0.42       0.63         0.81          0.99 ▂▃▅▆▇▇▇▇
#visna plot
visna(spotifydf, sort='b')

From above, we are fortunate to find no significant problems with the numerical features (i.e. values with ranges that do not make sense). However, there are 337 songs that have missing values for all Spotify features due to the fact that these songs were queried unsuccessfully. The plot below shows that the missingness of these features is mostly uncorrelated with the year in which the song charted.

na_years = df %>%
    filter(is.na(acousticness)) %>%
    select(year)
ggplot(na_years, aes(x=year)) + geom_histogram(bins=51)

One significant problem we noticed in the lyric text is that some lyrics were not scraped properly; for certain songs, there exists no demarcation between the final word of a line and the first word of the next line, resulting in merged words such as:

A cursory visual inspection shows that this problem does not apply to all lyrics, yet estimating the number of songs affected much less addressing this issue is beyond the scope of this project. Had time permitted, we would have processed the text using an NLP package to detect “out of vocabulary” words to gauge the extent of this issue.

A more tractable problem in the data is that some songs may become popular near the end of a year and remain popular through the beginning of the following year, resulting in 201 songs with a duplicate.

nrow(df[(duplicated(df[c('artist', 'song')])),][c('artist', 'song')])
## [1] 201

Duplicates are detected on the basis that an artist and song should comprise a unique pair, and duplicated artist-song pairs are removed from the dataset. Because the data is ordered chronologically, this means that the later record is always removed.

df = df[!(duplicated(df[c('artist', 'song')])),]

Below, we derive 4 additional features to aid in analysis:

# calculate words per second
temp = strsplit(df$lyrics, split=" ")
df['words_per_sec'] = sapply(temp, length) / (df['duration_ms'] / 1000)

# calculate duration in minutes
df['duration_min'] = df['duration_ms'] / 1000 / 60

# create a decade column
df['decade'] = floor(df['year'] / 10) * 10

# create base artist by stripping away featured artists
df = mutate(df, artist_base = str_replace_all(artist, "\\s\\(*feat.*", ""))

Main analysis

We present this exploratory data analysis in four sections as outlined in the introduction: artists, words, and audio features.

Artists

In this subsection, we examine the data at the artist level to answer some basic initial questions.

top_artists = df %>%
                group_by(artist_base) %>%
                summarize(num_singles = n()) %>%
                arrange(desc(num_singles))
top_artists_30 = top_artists[0:30,]
ggplot(top_artists_30, aes(x = reorder(artist_base, num_singles), y = num_singles)) +
    geom_col() + 
    coord_flip() +
    xlab('Artist') +
    ylab('Number of yearly top 100 singles from 1965-2015') +
    labs(title = 'Who are the most popular artists of the past 50 years?')

most_explicit_artists = df %>%
                group_by(artist_base) %>%
                summarize(explicitness = sum(explicit)) %>%
                arrange(desc(explicitness))

most_explicit_artists = most_explicit_artists[0:30,]
ggplot(most_explicit_artists, aes(x = reorder(artist_base, explicitness), y = explicitness)) +
    geom_col() + 
    coord_flip() +
    xlab('Artist') +
    ylab('Number of yearly top 100 explicit singles from 1965-2015') +
    labs(title = 'Who are the most popular explicit artists of the past 50 years?')

To measure the explicitness of each artist, we initially considered averaging the binary 0/1 explicit attribute of each song by artist. However, because many artists only have 1 single in the Yearly Top 100 throughoug their career, this results in many artists having an average explicitness of 1. Thus, we fall back to simply counting the number of explicit singles per artist. Interestingly, only the top 3 most explicit artists (Eminem, Ludacris, and Drake) are also among the top 30 most popular artists. Eminem and Ludacris are famously prolific and explicit; in fact we can see that all 15 and 14, respectively, of their top singles are explicit. More generally, we also notice that the vast majority of these singles are from the hip hop and rap genres.

most_featuring_artists = df %>%
                    mutate(is_collab = str_detect(artist, 'feat')) %>%
                    group_by(artist_base) %>%
                    summarize(num_collaborations = sum(is_collab)) %>%
                    arrange(desc(num_collaborations))
most_featuring_artists = most_featuring_artists[0:20,]

p1 = ggplot(most_featuring_artists, aes(x = reorder(artist_base, num_collaborations),
                                              y = num_collaborations)) +
    geom_col() +
    xlab('Main artist') +
    ylab('Number of top 100 singles featuring a guest artist') +
    scale_y_continuous(breaks=1:9, labels=1:9) + 
    coord_flip()

matches = str_match(as.list(df['artist'])$artist, 'featuring\\s(.*)')
matches = matches[, 2]
matches = matches[!is.na(matches)]
matches = as_tibble(matches)

featured_artists = matches %>%
                    group_by(value) %>%
                    summarize(num_features = n()) %>%
                    arrange(desc(num_features))
featured_artists = featured_artists[1:20,]

p2 = ggplot(featured_artists, aes(x = reorder(value, num_features),
                                              y = num_features)) +
    geom_col() +
    xlab('Featured artist') +
    ylab('Number of top 100 singles featured as guest') +
    scale_y_continuous(breaks=1:12, labels=1:12) + 
    coord_flip()

grid.arrange(p1, p2, ncol=2)

Many top singles are the product of a collaboration between two artists, with the resulting artist attribution of the song taking the form “X featuring Y”, where X denotes the main artist and Y denotes the guest/featured artist. Above, we see that Rihanna and Chris Brown most frequently feature other artists in their top singles, while Lil Wayne and T-Pain are the most frequent guests on other artists’ top singles. Particularly interesting is the fact that while Usher, Timbaland, Santana, and David Guetta are among the 20 most “featuring” artist, they are not among even the top 20 most “featured” artists. Conversely, T-Pain, Snopp Dogg, Nicki Minah, and Will.I.Am are among the 20 most “featured” artist despite not being among the 20 most “featuring” artists.

collaborations = df %>%
                    mutate(is_collab = str_detect(artist, 'feat')) %>%
                    group_by(year) %>%
                    summarize(num_collaborations = sum(is_collab))
ggplot(collaborations, aes(x = year, y = num_collaborations)) +
    geom_line() +
    geom_point() + 
    xlab('Year') +
    ylab('Number of top 100 singles featuring a guest artist')

More interesting still is the rising trend in artist collaborations over time that begins during the 1990s and takes off dramatically before plateauing in the late 2000s. Additional comments about this trend are presented in the executive summary.

nunique_artists_year = df %>% 
                        group_by(year) %>%
                        summarize(nunique = n_distinct(artist_base))

options(repr.plot.width = 16, repr.plot.height = 6)
ggplot(nunique_artists_year, aes(x = year, y = nunique), fill='black') + 
    geom_line() + 
    geom_point() + 
    ylab('Number of unique artists')

The plot above shows artist diversity over time, with diversity defined not on the basis of race but on the number of unique artists with top 100 singles each year. The trend depicted shows a mixed picture; the most diverse year is in the early 1970s while the least diverse is in 2009 and 2010, yet the pattern remains noisy enough that more years are needed before making a more confident determination.

top_artists = df %>%
                group_by(artist_base) %>%
                summarize(num_singles = n(), earliest_hit = min(year), latest_hit = max(year), longevity = latest_hit - earliest_hit, hits_per_year = num_singles / longevity) %>%
                arrange(desc(longevity))
top_artists_30 = top_artists[0:30,]

ggplot(top_artists_30) + geom_segment(aes(x=earliest_hit, xend=latest_hit, y=reorder(artist_base, longevity), yend=reorder(artist_base, longevity), color=num_singles), size=5) + geom_text(aes(x=latest_hit + 3, y=reorder(artist_base, longevity), label=paste(longevity, 'years'))) + ggtitle('Career spans of most timeless artists')

Besides just revealing which artists had the most top 100 hit singles, one of the more interesting aspects of the data is that it allows us to see which artists had the greatest longevity of career, defined simply as number of years between an artist’s earliest charting hit and most recent charting hit. A caveat here is that because the latest of any duplicate singles was removed from the dataset, some artists’ career lifespans appear shorter by a year than if duplicates singles had not been removed.

Above, we not only observe which artists had the greatest career longevity but also the time period during which an artist was popular, with lighter colors signifying an artist with greater numbers of total hit singles. Curiously, while Madonna and Mariah Carey have charted most frequently in the top 100, several artists, Santana, Cher, The Isley Brothers, Aretha Franklin, etc., have had longer charting careers.

This raises a natural question: does there exist a relationship between the number of singles an artist produces per year and their career span? Below, we observe that no artist has produced an average of greater than 2 hit singles a year and achieved a career span exceeding a decade. In fact, all artists generating on avearage more than 2.5 hit singles per year have career spans of less than 5 years, with the most significant “flash in the pan” artists being those that averaged 4 or more hit singles per year.

ggplot(top_artists, aes(x = longevity, y = hits_per_year, color = num_singles)) + geom_point(alpha=0.25)

However, being a “flash in the pan” artist with multiple hit singles over a short span of time might still be preferable to the fates of the majority of charting artists. In the cumulative relative frequency histogram below, we see that over half of all artists only ever generate a single hit, and over 3/4ths of artists ever generate at most 2 hits.

p1 = ggplot(top_artists) + geom_histogram(aes(x=num_singles))
p2 = ggplot(top_artists) + geom_histogram(aes(x=num_singles, y=cumsum(..count../sum(..count..))))

grid.arrange(p1, p2, ncol=2)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Audio features

In this subsection, we turn our focus to the audio features obtained from Spotify for each song, particularly how these features are related to one another as well as how they change over time. Originally, we aggregated each feature by year and plotted how the mean evolves over time. Not satisfied with the resulting loss in information by aggregating using solely the mean, we then incorporated additional aggregations such as the maximum and the minimum for each year before ultimately deciding to create box plots for each year to minimize information loss.

#Duration
spotifydf <- df%>%
                group_by(year)

ggplot(spotifydf) + geom_boxplot(aes(year, duration_min, group=year))+
    ggtitle("Duration over time")+
    ylab("Duration (in Minutes)")+
    scale_x_continuous(breaks = seq(1960,2020,5))
## Warning: Removed 327 rows containing non-finite values (stat_boxplot).

From above, we observe a gradual increase in the durations of top 100 singles that peaks at 1990 with median song lengths just shy of 5 minutes before starting on a downward trend to under 4 minutes. This trend of declining song durations coincides with a decline in the variation of song lengths. One possible explanation is that as the music industry has become increasingly competitive, artists are forced to grab their audience’s attention as quickly as possible, and as listeners gain access to an ever-expanding catalogue of songs, their attention spans are decreasing.

#Acousticness
ggplot(spotifydf) + geom_boxplot(aes(year, acousticness, group=year)) +
    ggtitle("Acousticness over time") +
    ylab("Acousticness") +
    scale_x_continuous(breaks = seq(1960,2020,5))
## Warning: Removed 327 rows containing non-finite values (stat_boxplot).

Top 100 songs have been trending downward in acousticness from 1965 to 2015. This trend might also be attributed to an increasingly competitive music industry; live studio musicians and indeed recording studios themselves are much more expensive in contrast to hardware synthesizers, and later, software synthesizers on which artists can create entire songs using only a laptop computer.

#Danceability
ggplot(spotifydf) + geom_boxplot(aes(year, danceability, group=year)) +
    ggtitle("Danceability over time")+
    ylab("Danceability ")+
    scale_x_continuous(breaks = seq(1960,2020,5))
## Warning: Removed 327 rows containing non-finite values (stat_boxplot).

While the median danceability of songs has remained relatively stable from 1985 to 2015, there exists a noticeable increase from 1965 to 1984.

#Energy
ggplot(spotifydf) + geom_boxplot(aes(year, energy, group=year)) +
    ggtitle("Energy over time")+
    ylab("Energy ")+
    scale_x_continuous(breaks = seq(1960,2020,5))
## Warning: Removed 327 rows containing non-finite values (stat_boxplot).

Overall, there appears that songs are becoming more energetic and busier over time, though the signal is quite noisy. Most interestingly, from 1985 onwards, the change in median energy over time appears to even be cyclical.

#Instrumentalness
ggplot(spotifydf) + geom_boxplot(aes(year, instrumentalness, group=year)) +
    ggtitle("Instrumentalness over time")+
    ylab("Instrumentalness ")+
    scale_x_continuous(breaks = seq(1960,2020,5))
## Warning: Removed 327 rows containing non-finite values (stat_boxplot).

Unsurprisingly, the median instrumentalness of top 100 singles consistently stays at or near 0, however we can observe a noticeable thinning out of outliers, suggesting that instrumental music is becoming increasingly unpopular. Additionally, this decline in the instrumentalness of songs could be tied to the decline of song lengths; if songs are becoming shorter because of declining listener attention spans, then one is likely to also observe songs with increasingly shorter instrumental introductions and interludes as artists favor “getting to the point.”

#Liveness
ggplot(spotifydf) + geom_boxplot(aes(year, liveness, group=year)) +
    ggtitle("Liveness over time")+
    ylab("Liveness ")+
    scale_x_continuous(breaks = seq(1960,2020,5))
## Warning: Removed 327 rows containing non-finite values (stat_boxplot).

There are no discernable trends in how the distribution of songs’ liveness changes over time.

#Loudness
ggplot(spotifydf) + geom_boxplot(aes(year, loudness, group=year)) +
    ggtitle("Loudness over Time")+
    ylab("Loudness ")+
    scale_x_continuous(breaks = seq(1960,2020,5))
## Warning: Removed 327 rows containing non-finite values (stat_boxplot).

Top 100 singles are unquestionably becoming much louder. We expand on an explanation of the “Loudness Wars” in the executive summary.

#Speechiness
ggplot(spotifydf) + geom_boxplot(aes(year, speechiness, group=year)) +
    ggtitle("Speechiness over Time")+
    ylab("Speechiness ")+
    scale_x_continuous(breaks = seq(1960,2020,5))
## Warning: Removed 327 rows containing non-finite values (stat_boxplot).

Also pronounced is a dramatic increase in speechiness starting from 1990 and peaking in 2004 which may correspond with the increased popularity of rap over that period of time.

#Tempo
ggplot(spotifydf) + geom_boxplot(aes(year, tempo, group=year)) +
    ggtitle("Tempo over Time")+
    ylab("Tempo ")+
    scale_x_continuous(breaks = seq(1960,2020,5))
## Warning: Removed 327 rows containing non-finite values (stat_boxplot).

There are no discernable trends in how the distribution of songs’ tempos changes over time.

#Valence
ggplot(spotifydf) + geom_boxplot(aes(year, valence, group=year)) +
    ggtitle("Valence over Time")+
    ylab("Valence")+
    scale_x_continuous(breaks = seq(1960,2020,5))
## Warning: Removed 327 rows containing non-finite values (stat_boxplot).

There is a general trend of songs becoming increasingly less positive, perhaps reflecting the increase in cultural and economic pressure society has been experiencing.

#Verbosity
ggplot(spotifydf) + geom_boxplot(aes(year, words_per_sec, group=year)) +
    ggtitle("Verbosity over Time")+
    ylab("Words per second")+
    scale_x_continuous(breaks = seq(1960,2020,5))
## Warning: Removed 327 rows containing non-finite values (stat_boxplot).

Corresponding to the trends involving speechiness over time, we see that the number of words heard per second in songs has increased over time. Indeed, one should expect this given the previously observed speechiness trends as it is much easier to speak quickly than to sing quickly.

keepcols <- c('year', 'acousticness', 'danceability', 'duration', 'energy', 'instrumentalness', 'liveness','loudness','speechiness', 'tempo', 'valence')

spotifydf<-spotifydf%>%
    filter(!is.na(duration_ms))%>%
    mutate(duration=duration_ms/1000/60)

spotifydf_s <- spotifydf%>%
                select(keepcols) %>%
                group_by(year)%>%
                summarize(mean_acousticness = mean(acousticness),
                        mean_danceability = mean(danceability),
                        mean_duration = mean(duration),
                        mean_energy = mean(energy),
                        mean_instrumentalness = mean(instrumentalness),
                        mean_liveness = mean(liveness),
                        mean_loudness = mean(loudness),
                        mean_speechiness = mean(speechiness),
                        mean_tempo = mean(tempo),
                        mean_valence = mean(valence)
                )
# %>%
#                 gather(key='variable', value = 'Freq', -year)
spotifydf_s$year<- factor(spotifydf_s$year, levels = unique(spotifydf_s$year))

ggparcoord(spotifydf_s, columns = 2:11, alphaLines = 0.7, groupColumn ='year',scale = 'uniminmax')+xlab("")+ylab("") + theme(axis.text.x = element_text(angle=90))

# custom function to transpose while preserving names
transpose_df <- function(df) {
  t_df <- data.table::transpose(df)
  colnames(t_df) <- rownames(df)
  rownames(t_df) <- colnames(df)
  return(t_df)
}

# using the function
dft<-transpose_df(spotifydf_s)
colnames(dft) <- dft[1, ]
dft <- dft[-1,]
dft<- dft%>%
    rownames_to_column('Variable')

ggparcoord(dft, columns = 2:52, alphaLines = 0.7, groupColumn ='Variable',scale = 'uniminmax')+xlab("")+ylab("") + theme(axis.text.x = element_text(angle=90))

spotifydf_s$year <- year(as.Date(as.character(spotifydf_s$year), "%Y"))

rownames(spotifydf_s)<- spotifydf_s$year
## Warning: Setting row names on a tibble is deprecated.
p <- spotifydf_s%>% arrange(year) %>%
    parcoords(
        rownames = F,
        brushMode = "1D-axes",
        reorderable = T,
        queue = T,
        alpha=.8,
        color = list(colorBy = "year", colorScale = htmlwidgets::JS("d3.scale.category10()")),
        width = 1100,
        height = 500
    )
p

To examine the correlation structure of the data’s continuous features, we create a correlation matrix and visualize it with a heatmap below. Additionally, we arrange the features so that those with the most similar correlations to other features are placed near each other, with the dendrograms on the top and left showing which features are most similar to each other.

columns = c('rank', 'year', 'acousticness', 'danceability', 'duration_min', 
            'energy', 'instrumentalness', 'liveness', 'loudness', 'popularity',
            'speechiness', 'tempo', 'valence', 'words_per_sec')

df_cor = df[columns]
correlation = cor(df_cor, method='pearson', use='pairwise.complete.obs')

col = colorRampPalette(c('red', 'white', 'green'))(20)
heatmap(x = correlation, col = col, symm = TRUE)

Note: This heatmap below was originally created in Python using the Seaborn visualization library. In translating the visualization to R using R’s heatmap functionality, we were unable to produce a legend nor annotate the grid cells individually. We present this second, more ideal heatmap as an image in addition to the one created in R above.

The dendrogrammed correlation heatmap reveals some interesting aspects of the data’s correlation structure:

  • Popularity and year are weakly correlated. This makes intuitive sense given that the audience on Spotify are more likely to listen to more current music and that people who would enjoy the older music might be less likely to listen to it on Spotify.
  • Loudness and year are also weakly correlated, as previously observed.
  • Loudness and energy are moderately correlated.
  • Acousticness and energy are weakly anticorrelated. This could be explained by the fact that ballads and other slow music are more likely to feature acoustic instruments.
  • Valance and danceability are weakly correlated. This could suggest that for the most part, people are less inclined to dance to sad or angry music.
  • Words per second is weakly correlated with speechiness. Indeed, it is easier to talk fast than to sing fast.

The correlation heatmap provides guidance as to which pairs of features are worth investigating further.

p1 = ggplot(df, aes(x=acousticness, y=energy)) +
    geom_point(alpha=0.15) +
    facet_wrap(~ decade)

p2 = ggplot(df, aes(x=acousticness, y=energy, color=year)) +
    geom_point(alpha=0.5) +
    ggtitle('Acousticness vs. energy')

grid.arrange(p1, p2, ncol = 2)
## Warning: Removed 327 rows containing missing values (geom_point).

## Warning: Removed 327 rows containing missing values (geom_point).

It appears that most songs have very low acousticness (below 0.125) and high energy (above 0.5). However, outside this dense region, as a song increases in acousticness, its energy appears to decrease quadratically. Furthermore, it appears that these less common high-acoustic/lower-energy songs are predominantly older, with the vast majority of more recent songs occupying the low-acoustic/higher-energy region. We can observe this more clearly by faceting on decade. The facetted scatterplots confirm more clearly that indeed songs are trending towards lower acousticness and higher energy with each passing decade (which was also observed in the time series box plots). It should be noted that because the time range of the dataset is from 1965-2015, the facets for the 1960s and 2010s have half as many data points as the facets for other decades.

p2 = ggplot(df, aes(x=danceability, y=valence, color=year)) +
    geom_point(alpha=0.5) + 
    ggtitle('Danceability vs valence')

p1 = ggplot(df, aes(x=danceability, y=valence)) +
    geom_point(alpha=0.15) +
    facet_wrap(~ decade)

grid.arrange(p1, p2, ncol = 2)
## Warning: Removed 327 rows containing missing values (geom_point).

## Warning: Removed 327 rows containing missing values (geom_point).

p2 = ggplot(df, aes(x=speechiness, y=words_per_sec, color=year)) +
    geom_point(alpha=0.5) +
    ggtitle('Speechiness vs verbosity')

p1 = ggplot(df, aes(x=speechiness, y=words_per_sec)) +
    geom_point(alpha=0.15) +
    facet_wrap(~ decade)

grid.arrange(p1, p2, ncol = 2)
## Warning: Removed 327 rows containing missing values (geom_point).

## Warning: Removed 327 rows containing missing values (geom_point).

Above, we observe that speechiness and verbosity (words per second) are indeed positively correlated, albeit moderately and that top 100 singles have become increasingly more speechy and verbose starting from the 1990s when rap and hip hop first gained mainstream appeal.

p1 = ggplot(df, aes(x=energy, y=loudness, color=year)) +
    geom_point(alpha=0.5) +
    ggtitle('Energy vs. loudness')

p2 = ggplot(df, aes(x=energy, y=loudness)) +
    geom_point(alpha=0.15) +
    facet_wrap(~ decade)

grid.arrange(p2, p1, ncol = 2)
## Warning: Removed 327 rows containing missing values (geom_point).

## Warning: Removed 327 rows containing missing values (geom_point).

Words

In this subsection, we perform analyses on the words within the lyrics of each song. In particular, given the rich set of features provided by Spotify, we can explore interesting questions such as “what are the most danceable words?” To facilitate such analysis, we use Python to tidy our dataset at the word-level. Whereas the original song-level dataset contains one row per song, the word-level dataset contains one row per unique word per song. For example, if the word “dance” appears multiple times in a single song, it would be transformed into a single row in the new dataset, but if the word “dance” appears in X songs, it would be transformed into X rows in the dataset. Each row of the word-level dataset inherits all the features of the song it belongs to (year, title, artist, audio features) and contains a new feature “count” indicating the number of times the word appears in that song. Furthermore, only words that appear in at least 10 different songs are included in the new dataset. Otherwise, if we attempt to answer “what are the most danceable words” by averaging the danceability of songs that word appears in, words that appear only in the single most danceable song would dominate such a ranking.

words = read_csv('../data/tidy-words.csv')
## Parsed with column specification:
## cols(
##   .default = col_double(),
##   rank = col_integer(),
##   song = col_character(),
##   artist = col_character(),
##   year = col_integer(),
##   release_date = col_character(),
##   spotify_album_name = col_character(),
##   spotify_artist = col_character(),
##   spotify_name = col_character(),
##   num_words = col_integer(),
##   artist_base = col_character(),
##   dummy = col_integer(),
##   word = col_character(),
##   count = col_integer()
## )
## See spec(...) for full column specifications.

Executive Summary

This section of the report describes our most interesting findings and conveys those to the reader in a succinct manner.

Popular music offers a unique lens through which to study how a culture evolves. It shows us how we as a society think and it is one form of art that everyone as a comminity enjoys. It allows us to mine insights about what preoccupies a society, what it values in its entertainment, and how its preferences change with each generation. We limit our view to only the 100 most popular songs for each of the past 50 years as we feel that is something that defines what people listened to/ liked most that year.

Our most interesting findings are logged below:

Love Conquers all, but unfortunately it’s not forever.

The analysis of the top 100 song lyrics from 1960 to 2014 reveal some interesting insghts. The most common word to appear until ‘year’ is Love, however it’s interesting to see the most common word to change to like in the subsequent years. Google word trends also confirms this fearful observation, according to google, the popularity of the word like in general; overtook the popularity of the word love almost as the same time shown in our plots. We claim that this is because of the word like developing as a filler in many of the sentences and our daily lives and has thus become a ubiquitous word. The popularity of the word in song lyrics is just a stronger assertion that our hypothesis is correct. But don’t worry, love still trumps like ;).

# some cool code displaying word clouds

Flash in the pan vs. steady burn

We seek to see which artists have had the longest careers, defined simply as number of years between an artist’s earliest charting hit and most recent charting hit.. There are some interesting findings summarized below.

Besides just revealing which artists had the most top 100 hit singles, one of the more interesting aspects of the data is that it allows us to see which artists had the greatest longevity of career. A caveat here is that because the latest of any duplicate singles was removed from the dataset, some artists’ career lifespans appear shorter by a year than if duplicates singles had not been removed. We not only observe which artists had the greatest career longevity but also the time period during which an artist was popular, with lighter colors signifying an artist with greater numbers of total hit singles. Curiously, while Madonna and Mariah Carey have charted most frequently in the top 100, several artists, Santana, Cher, The Isley Brothers, Aretha Franklin, etc., have had longer charting careers. This raises a natural question: does there exist a relationship between the number of singles an artist produces per year and their career span? Below, we observe that no artist has produced an average of greater than 2 hit singles a year and achieved a career span exceeding a decade. In fact, all artists generating on avearage more than 2.5 hit singles per year have career spans of less than 5 years, with the most significant “flash in the pan” artists being those that averaged 4 or more hit singles per year.

top_artists = df %>%
                group_by(artist_base) %>%
                summarize(num_singles = n(), earliest_hit = min(year), latest_hit = max(year), longevity = latest_hit - earliest_hit, hits_per_year = num_singles / longevity) %>%
                arrange(desc(longevity))
top_artists_30 = top_artists[0:30,]

ggplot(top_artists_30) + geom_segment(aes(x=earliest_hit, xend=latest_hit, y=reorder(artist_base, longevity), yend=reorder(artist_base, longevity), color=num_singles), size=5) + geom_text(aes(x=latest_hit + 3, y=reorder(artist_base, longevity), label=paste(longevity, 'years'))) + ggtitle('Career spans of most timeless artists') + xlab('Year') + ylab('Name of Artist')

This raises a natural question: does there exist a relationship between the number of singles an artist produces per year and their career span? Below, we observe that no artist has produced an average of greater than 2 hit singles a year and achieved a career span exceeding a decade. In fact, all artists generating on avearage more than 2.5 hit singles per year have career spans of less than 5 years, with the most significant “flash in the pan” artists being those that averaged 4 or more hit singles per year.

p1 = ggplot(top_artists, aes(x = longevity, y = hits_per_year, color = num_singles)) + geom_point(alpha=0.25)
p2 = ggplot(top_artists) + geom_histogram(aes(x=num_singles, y=cumsum(..count../sum(..count..))), bins=31)

grid.arrange(p1, p2, ncol=2)

Pay attention to me!

The music market has starting to become increasingly competitive with artists trying to vie for the attention of users from all age groups. The rise of streaming services such as spotify and Apple music has made music easily accessible, but also, has exposed people to more types of music genres and thus, it has become harder to capture and keep attention of the users.

‘The loudness war (or loudness race) refers to the trend of increasing audio levels in recorded music which many critics believe reduces sound quality and listener enjoyment.’wiki

With the advent of the Compact Disc (CD), music is encoded to a digital format with a clearly defined maximum peak amplitude. Once the maximum amplitude of a CD is reached, loudness can be increased still further through signal processing techniques such as dynamic range compression and equalization. It has been observed and documented that since 1990 that since the 1990’s that artists have adopted the practice of making their songs louder in the hope to make them more popular and resonate more with the audience.

ggplot(spotifydf) + geom_boxplot(aes(year, loudness, group=year)) +
    ggtitle("Loudness over Time")+
    ylab("Loudness ")+
    scale_x_continuous(breaks = seq(1960,2020,5))

p1 = ggplot(df, aes(x=energy, y=loudness, color=year)) +
    geom_point(alpha=0.5) +
    ggtitle('Energy vs. loudness') +xlab('Energy')

p2 = ggplot(df, aes(x=energy, y=loudness)) +
    geom_point(alpha=0.15) +
    facet_wrap(~ decade)

grid.arrange(p2, p1, ncol = 2)
## Warning: Removed 327 rows containing missing values (geom_point).

## Warning: Removed 327 rows containing missing values (geom_point).

It’s interesting to see that we observe similar findings in our data as well. The above plots show the increasing loudness of songs over time especially after the 1990’s. This also lends to the radio becoming more popular.

In the first plot above, it is interesting to see the general upward trend of the loudness variavle over time, the variability of loudness is getting low and the mean is gettiung high, it shows that most songs are being made ‘loud’ and those are the ones that are getting more recognition.

The second plot also shares some interesting insights. We compare energy and loudness by decade. Look closely, the loudness is generally following an increasing trend.

p1 = ggplot(df, aes(x=instrumentalness, y=duration_min, color=year)) +
    geom_point(alpha=0.5)

p2 = ggplot(df, aes(x=instrumentalness, y=duration_min)) +
    geom_point(alpha=0.15) +
    facet_wrap(~ decade)

grid.arrange(p2, p1, ncol = 2)
## Warning: Removed 327 rows containing missing values (geom_point).

## Warning: Removed 327 rows containing missing values (geom_point).

This is not it, there are other factors that confirm our suspicions. Intstrumental music, which by design is not loud is declining in popularity. Song durations are becoming less variable. This is also strongly related to our earlier hypothesis about capturing attention of the audience since they have access to a larger collection of music than ever before and can easily get bored

The decline in instrumentalness can be attributed to two causes:

  1. Shorter songs means less time for instruemntal intros or mid-song interludes and instrumental music tends to be longer and thus makes it difficult to catch the users’ attention

  2. A general decline in instruemntal music, there has been a large decline in instrumental artists and thus music. There is a comprehensive CNN article detailing the reasons.

Bleep Bleepin’ Bleep

Damn.

I’m sure that reference was clear? Well, Kendrick Lamar Just won the Pulitzer Prize for music. Its the first time in the history of music that a Rap artist has won the pulitzer prize.

Rap music has gained increasing pouulartity since the 90’s. This wikipedia article details the gaining popularity and the reasons for the same. It boils down to how we have changed culturally as a society, things that were once not accepted as a society and now open and Rap music is an example of that. We have culturally evolved from a more orthodox to a more open minded society. Our data again shows hard evidence that this is in fact happening.

#Speechiness
ggplot(spotifydf) + geom_boxplot(aes(year, speechiness, group=year)) +
    ggtitle("Speechiness over Time")+
    ylab("Speechiness ")+
    scale_x_continuous(breaks = seq(1960,2020,5))

p2 = ggplot(df, aes(x=words_per_sec, y=speechiness)) +
    geom_point(alpha=0.15) +
    facet_wrap(~ decade)

grid.arrange(p2, p1, ncol = 2)
## Warning: Removed 327 rows containing missing values (geom_point).

## Warning: Removed 327 rows containing missing values (geom_point).

There has also been a rise in another feature in music, this may be in part related to the increasing popularity of Rap music as well. Rap music often times contains explicit words and this is perfectlyu visible from the below plot.

explicit = df %>% 
           group_by(year) %>%
           summarize(prop_explicit = sum(explicit, na.rm=TRUE))

ggplot(explicit, aes(x = year, y = prop_explicit), fill='black') + 
    geom_line() + 
    geom_point() + 
    ylab('Number of explicit songs')

Shopw some word clouds with explicit words below!!!

# TODO 

'can also copy Louisword cloud showing the most explicit words. the fact that "feds/cops" and some other not intrinsically explicit words shows up is fascinating. will need to re-summarize how a words explicitness is calcualted (i.e. re-hash the process for creating "tidy words" dataset'
## [1] "can also copy Louisword cloud showing the most explicit words. the fact that \"feds/cops\" and some other not intrinsically explicit words shows up is fascinating. will need to re-summarize how a words explicitness is calcualted (i.e. re-hash the process for creating \"tidy words\" dataset"

Odds and ends

[ Kathy’s parallel coordinates plot ]

[ Louis’ word clouds for most danceable, most energetic, etc. words ]

Conclusion

This brings us to the end of the project. Well, Almost. We have some interesting interactive visualizations coming up ahead. However, we would like to take some time to reflect on the project and what we learned through the process and how we would have doner some things differently. This is more like a reflection than a Conclusion! we all know, conclusions are boring and we all hate writing them!

We encountered several problems while doing the project but we came out ahead and gained that in experience. The first major problem with the data was fused words. Word clouds have formed an important part of the project and messy word data was one opf the biggest problem we had. If we had more time would have liked to use an NLP package to tag out of vocabulary words to estimate the extent of the issue. We currently removed these words from the data.

Second, we would have liked to do some more feature engineering on the features and derive and infer some subtle features from the data using the features we already had and see how these imnplicit features ahve changed. For instance,using the tidy words transformation and averaging the audio features by word is a very heuristic method of calculating features like “danceability” for each word. A cooler method would have been to convert each song’s lyrics to bag-of-words data and trained multiple linear regressions using the lyrics to predict the song’s audio feature (i.e. one regression model for danceability, one regression model for energy, etc.). We could then use the coefficients from the linear regression to tell us what the most and least danceable/energetic/loud/acoustic/etc. Unfimilarity with R as a programming tool and time restrictions were the factors as to why we could not finish these analyses. We leave this as a further exercise to the reader.